Scale command / Z Distribuition

The main purpose here is to show how the scale command works in a dataset. In statistics it is very important to transform the data in Z-scale before analyzing them, thus obtaining a normalization of the dataset. When you have data of different magnitudes, kilometers, seconds, temperature etc. makes all sense converts them into a more homogenous distribution facilitating the analysis and the input of these data in some algorithms. Graphically I’m showing a dataset that was created in a random way that is gradually being staggered (steps of 10%). Initially we have for the X axis values between -5 and 5 and for Y values between -25 and 25. At the end of the procedure of scaling the dataset, we will have both X and Y values between -2 and 2, without losing the proportionality.

# Generating random dataset
x <- runif(1000,  -5, 5)
y <- runif(1000, -25, 25)
z <- runif(1000,   1, 2)

# Transforming in a data frame
sim.data       <- data.frame(cbind(x,y,z))

# Creating a complete data frame with all stage of scales
sim.data.total        <- sim.data
sim.data.total$pscale <- 0

# Steps of 10%
ntry <- 1:10
sim.part <- NULL
for (i in ntry) {
  
  # define the part (start and end) of the dataset to take
  end.r        <- 100 * i
  ini.r        <- end.r - 99
  sim.part     <- data.frame(rbind(sim.part, scale(sim.data[ini.r:end.r,])))
  
  # Scale only one part of the dataset
  if ((end.r)<nrow(sim.data)) {
    sim.data.scl <- data.frame(rbind(sim.part, sim.data[(end.r+1):nrow(sim.data),]))
  } 
  # if end.r < nrow(sim.data) scale full dataset
  else {
    sim.data.scl <- data.frame(scale(sim.data))
  }
  
  sim.data.total <- rbind(sim.data.total, data.frame(x=sim.data.scl$x, y=sim.data.scl$y, z=sim.data.scl$z, pscale=i*10))
    
}
# Store in p variable the ggplot object
p <- ggplot(sim.data.total, aes(x=x, y=y, color=z)) + geom_point(aes(frame = pscale)) + ggtitle("Scaling the Dataset") + theme(plot.title = element_text(hjust = -5, vjust=0))
## Warning: Ignoring unknown aesthetics: frame
# Animate the ggplot object with ggploty
ggplotly(p) %>% animation_opts(1000) %>% animation_slider(currentvalue = list(prefix = "Scale ", posfix = "%", font = list(size=12, color="red")))

Session info

For reproducibility purposes it is always a good idea to capture the state of the environment that was used to generate the results:

sessionInfo()
## R version 3.4.4 (2018-03-15)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=Portuguese_Brazil.1252  LC_CTYPE=Portuguese_Brazil.1252   
## [3] LC_MONETARY=Portuguese_Brazil.1252 LC_NUMERIC=C                      
## [5] LC_TIME=Portuguese_Brazil.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] plotly_4.8.0  ggplot2_3.0.0
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.17      later_0.7.3       pillar_1.3.0     
##  [4] compiler_3.4.4    plyr_1.8.4        bindr_0.1.1      
##  [7] tools_3.4.4       digest_0.6.15     viridisLite_0.3.0
## [10] jsonlite_1.5      evaluate_0.11     tibble_1.4.2     
## [13] gtable_0.2.0      pkgconfig_2.0.1   rlang_0.2.1      
## [16] shiny_1.1.0       crosstalk_1.0.0   yaml_2.1.19      
## [19] bindrcpp_0.2.2    withr_2.1.2       dplyr_0.7.6      
## [22] stringr_1.3.1     httr_1.3.1        knitr_1.20       
## [25] htmlwidgets_1.2   rprojroot_1.3-2   grid_3.4.4       
## [28] tidyselect_0.2.4  glue_1.2.0        data.table_1.11.4
## [31] R6_2.2.2          rmarkdown_1.10    tidyr_0.8.1      
## [34] purrr_0.2.5       magrittr_1.5      promises_1.0.1   
## [37] backports_1.1.2   scales_0.5.0      htmltools_0.3.6  
## [40] assertthat_0.2.0  xtable_1.8-2      mime_0.5         
## [43] colorspace_1.3-2  httpuv_1.4.4.2    labeling_0.3     
## [46] stringi_1.1.7     lazyeval_0.2.1    munsell_0.5.0    
## [49] crayon_1.3.4